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Abstract 

We show how to compute the edit distance between two strings of length n up to a factor 
of 2'^(\/'°g") in 7i^+°(^) time. This is the first sub-polynomial approximation algorithm for this 
problem that runs in near-linear time, improving on the state-of-the-art ri}/^'^°^^'^ approximation. 
Previously, approximation of 2'^(^'°s") -^^^s known only for embedding edit distance into £i, and 
it is not known if that embedding can be computed in less than quadratic time. 



1 Introduction 

The edit distance (or Levenshtein distance) between two strings is the number of insertions, dele- 
tions, and substitutions needed to transform one string into the other [Lev65]. This distance 



is of fundamental importance in several fields such as computational biology and text process- 



ing/searching, and consequently, problems involving edit distance were studied extensively (see [NavOl] 
[ |Gus97[ , and references therein). In computational biology, for instance, edit distance and its slight 
variants are the most elementary measures of dissimilarity for genomic data, and thus improvements 
on edit distance algorithms have the potential of major impact. 

The basic problem is to compute the edit distance between two strings of length n over some 
alphabet. The text-book dynamic programming runs in O(n^) time (see [ PLRSOI | and references 



therein). This was only slightly improved by Masek and Paterson [|MP80|| to 0{r? j log^ n) time for 
constant-size alphabets^. Their result from 1980 remains the best algorithm to this date. 

Since near-quadratic time is too costly when working on large datasets, practitioners tend to 
rely on faster heuristics (see [ pus97 ], [NavOl]). This leads to the question of finding fast algorithms 
with provable guarantees, specifically: can one approximate the edit distance between two strings 
in near-linear time ^I^dOll , |BEK+03| , |BJKK04| , |BES06| , |CPSVOq , |Cor03| , |OR0^ , |KN06| , |KR06|| ? 



*A preliminary version of this paper appeared in Proceedings of the 41st Annual ACM Symposium on Theory of 
Computing (STOC 2009), Bethesda, MD, USA, 2009, pp. 199-204. 

^This work was done when the author was at Massachusetts Institute of Technology, while supported in part by 
David and Lucille Packard Fellowship and by MADALGO (Center for Massive Data Algorithmics, funded by the 
Danish National Research Association) and by NSF grant CCF-0728645. 

* Supported in part by a Symantec research fellowship, NSF grant 0728645, and NSF grant 0732334. This work 
was done when the author was a graduate student at Massachusetts Institute of Technology. 

^The result has been only recently extended to arbitrarily large alphabets by Bille and Farach-Colton BFCO^ 
with a O(loglogn)^ factor loss in time. 
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Prior results on approximate algorithms^. A linear-time y^-approximation algorithm im- 
mediately follows from the 0{n + (P)-time exact algorithm (see Landau, Myers, and Schmidt 
||LMS98 |), where d is the edit distance between the input strings. Subsequent research improved 
the approximation first to n^^"^ , and then to n^/^^°W ^ ([^q to, respectively, Bar-Yossef, Jayram, 
Krauthgamer, and Kumar [BJKK04|, and Batu, Ergiin, and Sahinalp pES06 |. 

A sublinear time algorithm was obtained by Batu, Ergiin, Kilian, Magen, Raskhodnikova, Ru- 



binfeld, and Sami [ BEK^03 |. Their algorithm distinguishes the cases when the distance is 0{n 



vs. r2(n) in d{n^ '^^ + n 



(l-^)/2^ 



time for any e > 0. Note that their algorithm cannot distinguish 



distances, say, 0{n^'^) vs. il{n^-^). 

On a related front, in 2005, the breakthrough result of Ostrovsky and Rabani gave an embedding 
of the edit distance metric into £i with 2'^'-^'°^") distortion OR07 | (see preliminaries for definitions). 
This result vastly improved related applications, namely nearest neighbor search and sketching. 
However, it did not have implications for computing edit distance between two strings in sub- 
quadratic time. In particular, to the best of our knowledge it is not known whether it is possible 
to compute their embedding in less than quadratic time. 



The best approximation to this date remains the 2006 result of Batu, Ergiin, and Sahinalp [ BES06 |, 
achieving approximation. Even for n^"*^ time, their approximation is n'^/^'^''^^\ 



Our result. We obtain 2*^^^'°^") approximation in near-linear time. This is the first sub- 
polynomial approximation algorithm for computing the edit distance between two strings running 
in strongly subquadratic time. 

Theorem 1.1. The edit distance between two strings x,y G {0, 1}" can be computed up to a factor 

20{Vlognloglog n) ^ . 20(Vlognlog logn.) ^^^g_ 

Our result immediately extends to two more related applications. The first application is to 
sublinear-time algorithms. In this scenario, the goal is to compute the distance between two strings 
x,y of the same length n in o(n) time. For this problem, for any a < /3 < 1, we can distinguish 
distance 0(n") from distance J7(n^) in 0{n°'~^'^^^~^^~^°^^^) time. 

The second application is to the problem of pattern matching with errors. In this application, 
one is given a text T of length N and a pattern P of length n, and the goal is to report the 
substring of T that minimizes the edit distance to P. Our result immediately gives an algorithm 
for this problem running in 0{N log N) • 2'^(^'°s") time with 2*^(^^°s") approximation. We note that 



the best exact algorithm for this problem runs in time 0(A^n/log^n) [MP80 |. Better algorithms 



may be obtained if we restrict the minimal distance between the pattern and best substring of 



T or for relatives of the edit distance. In particular, Sahinalp and Vishkin [3V96| and Cole and 
Hariharan |CII02| showed linear-time algorithms for finding all substrings at distance at most n^, 
where c is a constant in (0,1). Moreover, Cormode and Muthukrishnan gave a near-linear time 
0(logn)-approximation algorithm when the distance is the edit distance with moves. 



1.1 Preliminaries and Notation 

Before describing our general approach and the techniques used, we first introduce a few definitions. 



^We make no attempt at presenting a complete list of results for restricted problems, such as average case edit 
distance, weakly-repetitive strings, bounded distance regime, or related problems, such as patt ern matching/nearest 



neighbor, sketching. However, for a very thorough survey, if only slightly outdated, see [NavOl] 
''We use 0(/(n)) to denote f(n) ■ \og°^^'> f{n). 
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We write ed(x, y) to denote the edit distance between strings x and y. We use the notation 
[n] = {1, 2, 3, . . . n}. For a string x, a substring starting at i, of length m, is denoted x[i : i + m — 1]. 
Whenever we say with high probability (w.h.p.) throughout the paper, we mean "with probabihty 
1 — l/p(n)", where p{n) is a sufficiently large polynomial function of the input size n. 

Embeddings. For a metric (M, (Im), and another metric {X, p), an embedding is a map <j) : M ^ 
X such that, for all x,y £ M, we have dMix,y) < p{(j){x) , (j){y)) < 7 • dM{x,y) where 7 > 1 is the 
distortion of the embedding. In particular, all embeddings in this paper are non-contracting. 

We say embedding cj) is oblivious if for any subset S <Z M oi size n, the distortion guarantee 
holds for all pairs x,y £ S with high probability. The embedding (p is non-oblivious if it holds for 
a specific set S (i.e., cj) is allowed to depend on S). 

Metrics. The A;-dimensional £1 metric is the set of points living in M'^ under the distance ||x — 
y\\i = Yli=i ~ yi\- We also denote it by i^. 

We define thresholded Earth-Mover Distance, denoted TEMDt for a fixed threshold i > 0, as 
the following distance on subsets A and B of size s G N of some metric {M,dM)- 



where r ranges over all bijections between sets A and B. TEMDqo is the simple Earth-Mover 
Distance (EMD). We will always use t = s and thus drop the subscript t; i.e., TEMD = TEMD^. 

A graph (tree) metric is a metric induced by a connected weighted graph (tree) G, where the 
distance between two vertices is the length of the shortest path between them. We denote an 
arbitrary tree metric by TM. 

Semimetric spaces. We define a semimetric to be a pair (M, dM) that satisfies all the properties 
of a metric space except the triangle inequality. A ^-near metric is a semimetric (M, cZm) such that 
there exists some metric {M,d\j) (satisfying the triangle inequality) with the property that, for 
any x,y £ M, we have that d\^{x,y) < dM{x,y) < 7 • d\j{x,y). 

Product spaces. A sum-product over a metric A4 = {M,dM), denoted A^, is a derived 
metric over the set M'', where the distance between two points x = (xi, . . . Xk) and y = {yi, . . . yk) 
is equal to 




(1) 




i6[fc] 



For example the space M is just the fe-dimensional li. 

Analogously, a min-product over Ai = {M,dM), denoted 0^ii, 
where the distance between two points x = {xi, ■ ■ ■ x^) and y = {yi, 



Ai, is a semimetric over M'' 
...yk) is 



c^min, Af {x,y) = min { ^a/ (a^i , yi ) } ■ 



ie[fc] 



We also slightly abuse the notation by writing 0^;^ TM to denote the min-product of k tree 
metrics (that could differ from each other). 
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1.2 Techniques 



Our starting point is the Ostrovsky-Rabani embedding |OR0?]. For strings x,y, as well as for all 
substrings a of specific lengths, we compute some vectors living in low-dimensional ii such that 
the distance between two such vectors approximates the edit distance between the associated (sub- 
)strings. In this respect, these vectors can be seen as an embedding of the considered strings into 
ii of polylogarithmic dimension. Unlike the Ostrovsky-Rabani embedding, however, our embedding 
is non-oblivious in the sense that the vectors are computed given all the relevant strings a. In 
contrast, Ostrovsky and Rabani give an oblivious embedding (pn '■ {0, 1}" — ?■ ii such that ||(^„(x) — 
4'n{y)\\i approximates ed{x,y). However, the obliviousness comes at a high price: their embedding 
requires a high dimension, of order and a high computation time, of order (even 

when allowing randomized embedding, and a constant probability of a correctness). We further 
note that reducing the dimension of this embedding seems unlikely as suggested by the results on 



impossibility of dimensionality reduction within ii [ CS02| , |BC03 , LN04]. Nevertheless, the general 



recursive approach of the Ostrovsky-Rabani embedding is the starting point of the algorithm from 
this paper. 

The heart of our algorithm is a near-linear time algorithm that, given a sequence of low- 
dimensional vectors vi, . . . Vn ^ ii and an integer s < n, constructs new vectors qi, . . . qm € ii , 
where m = n — s + 1, with the following property. For all i,j G [m], the value \\qi — qj\\i ap- 
proximates the Earth-Mover Distance (EMD)^ between the sets Ai = f j+i, . . . f j_|_s„i} and 
Aj = {vj,Vj+i, . . .Vj+s~i}- To accomplish this (non-oblivious) embedding, we proceed in two 
stages. First, we embed (obliviously) the EMD metric into a min-product of ii 's of low dimension. 
In other words, for a set A, we associate a matrix L{A), of polylogarithmic size, such that the EMD 
distance between sets A and B is approximated by min^ \L{A)rt — L{B)rt\- Min-products help 
us simultaneously on two fronts: one is that we can apply a weak dimensionality reduction in £i, 
using the Cauchy projections, and the second one enables us to accomplish a low-dimensional EMD 
embedding itself. Our embedding L(-) is not only low-dimensional, but it is also linear, allowing us 
to compute matrices L{Ai) in near-linear time by performing one pass over the sequence fi, . . . u„. 
Linearity is crucial here as even the total size of Aj's is \ Ai\ = {n — s + 1) ■ s, which can be as 
high as r2(n^), and so processing each Ai separately is infeasible. 

In the second stage, we show how to embed a set of n points lying in a low-dimensional min- 
product of ^I's back into a low-dimensional ii with only small distortion. We note that this is 
not possible in general, with any bounded distortion, because such a set of points does not even 
form a metric. We show that this is possible when we assume that the semi-metric induced by 
the set of points approximates some metric (in our case, the set of points approximates the initial 
EMD metric). The embedding from this stage starts by embedding a min-product of iis into a 
low-dimensional min-product of tree metrics. We further embed the latter into an n-point metric 
supported by the shortest-path metric of a sparse graph. Finally, we observe that we can implement 
Bourgain's embedding on a sparse graph metric in near-linear time. These last two steps make our 
embedding non-oblivious. 

1.3 Recent Work 

We note that the recent work |AKO10| has shown that one can approximate the edit distance 



between two strings up to a multiplicative factor of (log n)*-^^^/'^) in n^"'"'^ time, for any desired 



In fact, our algorithm does this for thresholded EMD, TEMD, but the technique is precisely the same. 
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e > 0. Although the new result obtains polylogarithmic approximation, the running time is slightly 
higher than the algorithm presented here. For a comparable approximation, obtained for e = 
Y^log log nj log n, the algorithm of |AKO10| does not improve the running time (up to constants 
hidden by the big O notation). We further remark that the techniques of [ AKO10| ] are disjoint 
from the techniques presented here, and are based on asymmetric sampling of one of the strings. 



2 Short Overview of the Ostrovsky-Rabani Embedding 

We now briefly describe the embedding of Ostrovsky and Rabani |OR07|. Some notions introduced 



here are used in our algorithm described in the next section. 

The embedding of Ostrovsky and Rabani is recursive. For a fixed n, they construct the 
embedding of edit distance over strings of length n using the embedding of edit distance over 
strings of shorter lengths / < n/2^'°S"^°s logn^ denote their embedding of length-n strings by 
<\)n '■ {0,1}" — )■ £i, and let be the resulting distance: d^^{x,y) = ||(/>„,(x) — (/>n(?/)||i. For 
two strings x,y ^ {0, l}*^, the embedding is such that d^^ = \\cl)n{x) — i;^'n.(y)||i approximates an 
"idealized" distance d'^{x,y), which itself approximates the edit distance between x and y. 

Before describing the "idealized" distance d* , we introduce some notation. Partition x into 
5 ^ 2Viogniogiogn blocks cahed x(^\ . . . x(^) of length I = n/b. Next, fix some j G [b] and s < /. We 
consider the set of all substrings of x^^^ of length I — s + 1, embed each one recursively via 
and define Sj{x) C ii to be the set of resulting vectors (note that \Sj \ = s). Formally, 

S]{x) = {cf>i^s+i{x[{j - 1)1 + z : {j - 1)1 + z + 1 - s]) \ z € [s]}. 

Taking (pi^s+i as given (and thus also the sets Sj{x) for all x), define the new "idealized" distance 
dn approximating the edit distance between strings x,y £ {0, 1}" as 

b 

dl{x,y) = cY, E TEMD(5|(x),5|(y)) (2) 

j = l /6N 
s=2f<l 

where TEMD is the thresholded Earth- Mover Distance (defined in Equation (|^)), and c is a suffi- 
ciently large normalization constant (c > 12 suffices). Using the terminology from the preliminaries, 
the distance function d* can be viewed as the distance function of the sum-product of TEMDs, 
i-e., TEMD, and the embedding into this product space is attained by the natural 

identity map (on sets S'j). 

The key idea is that the distance d^{x, y) approximates edit distance well, assuming that (j)i_s+i 
approximates edit distance well, for all s = 2-^ where / G {1,2,... [log2/J}. Formally, Ostrovsky 
and Rabani show that: 

Fact 2.1 ( |OR07| ). Fix n and b < n, and let I = n/b. Let D^/b be an upper bound on distortion of 
4>i-s+i viewed as an embedding of edit distance on strings {x[i : i+l — s],y[i : i+l — s] \ i G [n— /+s]}, 
for all s = 2f where / G {1, 2, . . . [logs ^J}- Then, 



ed(x, y) < d* (x, y) < ed{x, y) ■ {D^/b + b) • 0(log 



n) 



To obtain a complete embedding, it remains to construct an embedding approximating d* up to 
a small factor. In fact, if one manages to approximate d* up to a poly-logarithmic factor, then the 
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final distortion comes out to be 2'^(^'°§"'°s'°g"). This follows from the following recurrence on the 
distortion factor Suppose is an embedding that approximates d* up to a factor log*^^^^ n. 
Then, if is the distortion of (as an embedding of edit distance), then Fact 2.1 immediately 
implies that, for b = 2Viogniogiogn^ 



Da < D, 



This recurrence solves to Dn < 2'^(^/i°g^*i°g^) as proven in [ pR07|l . 

Concluding, to complete a step of the recursion, it is sufficient to embed the metric given 
by d* into ii with a poly logarithmic distortion. Recall that d* is the distance of the metric 
®i 0?/^°^ "^ TEMD, and thus, one just needs to embed TEMD into £i. Indeed, Ostrovsky and 
Rabani show how to embed a relaxed (but sufficient) version of TEMD into ii with O(logn) 
distortion, yielding the desired embedding which approximates d* up to a O(logn) factor at 
each level of recursion. We note that the required dimension is 0{n). 



3 Proof of the Main Theorem 

We now describe our general approach. Fix x G {0, 1}"". For each substring a of x, we construct a 
low-dimensional vector Va- such that, for any two substrings <t, r of the same length, the edit distance 
between a and r is approximated by the ii distance between the vectors Va- and Vr- We note that 
the embedding is non-oblivious: to construct vectors we need to know all the substrings of x 
in advance (akin to Bourgain's embedding guarantee). We also note that computing such vectors 
is enough to solve the problem of approximating the edit distance between two strings, x and y. 
Specifically, we apply this procedure to the string x' = x o y, the concatenation of x and y, and 
then compute the ii distance between the vectors corresponding to x and y, substrings of x' . 

More precisely, for each length m G W, for some set W C [n] specified later, and for each 
substring x[i : i + m — 1], where i = — m-|-l, we compute a vector uj'"^ in if, where 

a = 2^(^^°s »^)_ The construction is inductive: to compute vectors we use vectors vf^ for 

/ <^ m and I G W. The general approach of our construction is based on the analysis of the 
recursive step of Ostrovsky and Rabani, described in Section In particular, our vectors v^"^^ G £i 
will also approximate the distance (given in Equation (^)) with sets Sf defined using vectors 
vf^ with I <^ m. 

The main challenge is to process one level (vectors v^"^^ for a fixed m) in near-linear time. Besides 
the computation time itself, a fundamental difficulty in applying the approach of Ostrovsky and 
Rabani directly is that their embedding would give a much higher dimension a, proportional to 
0{m). Thus, if we were to use their embedding, even storing all the vectors would take quadratic 
space. 

To overcome this last difficulty, we settle on non-obliviously embedding the set of substrings 
x[i : i + m — 1] for i £ [n — m + 1] under the "ideal" distance d"^ with log*^^"*^^ n distortion (formally, 

under the distance djjj from Equation (§), when Sj{x[i : i + m — 1]) = ^vf^i^'^^y_^^_-^ | z G for 

I ^ m/2Viogniogiogn)_ Existentially, we know that there exist vectors w^""^ G RC»(i°g'") such that 

ll^j™) — Wj^^ 111 approximates d^{x[i : i + m — 1], x[j : j + m — 1]) for all i and j — this follows by 

the standard Bourgain's embedding |Bou85]. The vectors v^"^^ that we compute approximate the 
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properties of the ideal vectors wf^^ . Their efficient computability comes at the cost of an additional 
polylogarithmic loss in approximation. 

The main building block is the following theorem. It shows how to approximate the TEMD 
distance for the desired sets 5|. 

Theorem 3.1. Let n G N and s G [n]. Let vi, . . . f„ be vectors in {—M, . . . M}", where M = n'^^-^) 
and a < n. Define sets Ai = {wj, Vj+i, . . . ViJ^s-i} for i £ [n — s + 1]. 

Let t = O(log^n). We can compute (randomized) vectors qi G £\ for i G [?i — s + 1] such that 
for any i,j G [n — s + 1], with high probability, we have 

TEMD{A„Aj) < \\qi - QjWi < TEMD{Ai,Aj) ■ log'^^^^ n. 

Furthermore, computing all vectors qi takes 0{na) time. 

To map the statement of this theorem to the above description, we mention that, for each / = 
m/b for m G W, we apply the theorem to vectors (vf for each s = 1, 2, 4, 8, . . . 2^^°^^ 'J . 



e[n-l+s 

We prove Theorem |3.1| in later sections. Once we have Theorem it becomes relatively 



straight-forward (albeit a bit technical) to prove the main theorem. Theorem LI. We complete the 



proof of Theorem next, assuming Theorem 3.1 



of Theorem 1.1. We start by appending y to the end of x; we will work with the new version of 
X only. Let b = 2Viogn.iogiogn ^ ^ 0{blog^n). We construct vectors wj""^ G for m €W, 
where W C [n] is a carefully chosen set of size 

20(viogniogiogn)^ Namely, W is the minimal set 
such that: n G W, and, for each i € W with i > b, we have that i/b — 2^ + 1 G W for all integers 
j < Llog2 V^J- It is easy to show by induction that the size of W is 2*^(^^°s"-iog logn)^ construct 
the vectors u^-™^ inductively in a bottom-up manner. We use vectors for small m to build vectors 
for large m. W is exactly the set of lengths m that we need in the process. 

Fix an m G such that m < b"^ = 2Viogniogiogn_ ^^^^^ ^^i^ ^gp^^^ ^{rn) ^^^^^ 

hm{x[i '■ i + m — 1]), where hm '■ {0, 1}"* — s- {0, 1}° is a randomly chosen function. It is readily seen 
that llvj"*^ - v^^'^Wi approximates ed(x[i : i + m - l],x[j : j + m - 1]) up to fe^ = 22v'i°g'^i°si°g'^ 
approximation factor, for each i,j G [n — m + 1]. 

Now consider m £ W such that m > b'^. Let I = m/b. First we construct vectors approximating 

TEMD on sets A^"'' = {vf^T/"^^^ |z = 0,...s-l}, where s = 1, 2, 4, 8, . . . , ^ and « G [n - I + s]. 



In particular, for a fixed s G [/] equal to a power of 2, we apply Theorem 3.1 to the set of vectors 



(/ s+i)\ obtaining vectors ( ■ Theorem |3.1| guarantees that, for each i,j G 



i&[n-l+s] \ /je[n-/+l 

[n— ^+1], the value \\qf^''^^ —qj"^'^^ ||i approximates TEMD(A™'''^, ^J*'*) up to a factor of log '^^^^ n. We 
can then use these vectors q^'^'^ to obtain the vectors f j^"*^ G that approximate the "idealized" 
distance d*^ on substrings x[i : « + m — 1], for i G [n — m + 1]. Specifically, we let the vector v^[^'^ 
be a concatenation of vectors where j G [6], and s goes over all powers of 2 less than I: 

(rra) / (m,s 



(1^,3) \ 
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Then, the vectors v^'^ approximate the distance d^^ (given in Equation (^)) up to a log*^^^^ n 
approximation factor, with the sets ^^(^[i : i + m — 1]) taken as 

S'j{x[i -.1 + 7X1-1]) = = {v^i+'jt%+, I z = 0, . . . s - 1} , 

for is [n — m + 1] and j € [6]. 

(n) (n) 

The algorithm finishes by outputting \\v\ — which is an approximation to the edit 

distance between x[l : n] and x[n+l : 2n] = y. The total running time is 0{\W\-n-b'-^^^'' -log^^^^ n) = 

jl . 20(v'lognlog logn)_ 

It remains to analyze the resulting approximation. Let Dm be the approximation achieved by 



(k) In 

vectors S ii for substrings of x of lengths k, where k £ W and k < m. Then, using Fact ^J. 
and the fact that vectors vf^^ G £1 approximate d^, we have that 

Dm < log°(^) n . (Dm/b + 2^'°s"^°s'°s" 



Since the total number of recursion levels is bounded by logj, n = y , we deduce that 



3.1 Proof of Theorem lOI 



The proof proceeds in two stages. In the first stage we show an embedding of the TEMD metric 
into a low-dimensional space. Specifically, we show an (oblivious) embedding of TEMD into a 
min-product of ii. Recall that the min-product of £1, denoted 0|nin^i, is a semi-metric where the 

distance between two l-hy-k vectors x,y G M'^'^ is (imin,i(a;, y) = miujgj;] |X]je[fc] l^'-i ~ 
min-product of ^I's has dimensions / = O(logn) and k = O(log'^n). The min-product can be seen 
as helping us on two fronts: one is the embedding of TEMD into li (of initially high-dimension), 
and another is a weak dimensionality reduction in li, using Cauchy projections. Both of these 
embeddings are of the following form: consider a randomized embedding / into (standard) ii that 
has no contraction (w.h.p.) but the expansion is bounded only in the expectation (as opposed to 
w.h.p.). To obtain a "w.h.p." expansion, one standard approach is to sample / many times and 
concentrate the expectation. This approach, however, will necessitate a high number of samples of 
/, and thus yield a high final dimension. Instead, the min-product allows us to take only O(logn) 
independent samples of /. 

We note that our embedding of TEMD into min-product of ii, denoted A, is linear in the sets A: 
= YlaeA 'r^^ linearity allows us to compute the embedding of sets Ai in a streaming 

fashion: the embedding of Ai^i is obtained from the embedding of Ai with log*^^^^ n additional 
processing. This stage appears in Section |3.1.1| . 

In the second stage, we show that, given a set of n points in min-product of £i's, we can (non- 
obliviously) embed these points into low-dimensional li with O(logn) distortion. The time required 
is near- linear in n and the dimensions of the min-product of iis. 

To accomplish this step, we start by embedding the min-product of ^I's into a min-product of 
tree metrics. Next, we show that n points in the low-dimensional min-product of tree metrics can 
be embedded into a graph metric supported by a sparse graph. We note that this is in general 
not possible, with any (even non-constant) distortion. We show that this is possible when we 
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assume that our subset of the min-product of tree metrics approximates some actual metric (in 
our case, the min-product approximates the TEMD metric). Finally, we observe that we can 
implement Bourgain's embedding in near-linear time on a sparse graph metric. This stage appears 
in Section [3.1.2 . 



We conclude with the proof of Theorem 3.1 in Section 3.I.3. 



3.1.1 Embedding EMD into min-product of £i 

In the next lemma, we show how to embed TEMD into a min-product of ^I's of low dimension. 
Moreover, when the sets Ai are obtained from a sequence of vectors vi,...Vn, by taking Ai = 
{vi, . . . f we can compute the embedding in near-linear time. 

Lemma 3.2. Fix n, M G N and s E [n]. Suppose we have n vectors vi,...Vn in {—M,—M + 
1, . . . ,M}" for some a < n. Consider the sets Ai = {vi,Vi+i, . . . Vi+s~i}, for i £ [n — s + 1]. 

Let k = O(log'^n). We can compute (randomized) vectors qi G for i G [n — s -|- 1] such that, 
for any i, j £ [n — s + 1] we have that 



Pr 



h - qjh < TEMD(^i, Aj) ■ 0{log^ n) 



> 0.1 and 



\qi-qj\\i> TEMD{Ai,Aj) w.h.p. 



The computation time is 0{na). 

Thus, we can embed the TEMD metric over sets Ai into ©min^i' f'^^ ^ ~ O(logn), such that 
the distortion is O(log^n) w.h.p. The computation time is 0{na). 

Proof. First, we show how to embed TEMD metric over the sets Ai into ii of dimension M*^^"-* • 
O(logn). For this purpose, we use a slight modification of the embedding of |AIK08| (it can also 
be seen as a strengthening of the TEMD embedding of Ostrovsky and Rabani) . 

The embedding of ||AIK08 | constructs m = O(logs) embeddings "01; each of dimension h 



M'^("\ and then the final embedding is just the concatenation = -01 o . . -oipm- For i = 1, . . .m, 
we impose a randomly shifted grid of side-length Ri = 2*~^. That is, let Aj = {5i^i, . . . be 
selected uniformly at random from [0, 1)°. A specific vector Vj falls into the cell (ci, . . . , Cq), where 
ct = \vj,t/Ri + ^i,t\ for t = 1, . . . , Q. Then -0^ has a coordinate for each cell (ci, . . . , Cq), where 
< Q < 2M/Ri + 1 for t = 1, . . . ,Q;. These are the only cells that can be non-empty, and there 
is at most {2M/Ri + 1)" = Af^^") of them. The value of a specific coordinate, for a set A, equals 
the number of vectors from A falling into the corresponding cell times Ri. Now, if we scale up 
by a factor of 0(-logn), Theorem 3.1 from [AJKOS]^ says that the vectors = ipiAi) satisfy the 



condition that, for any i,j G [n — s + 1], we have: 



E 



Wi-Q'iWi 



< TEMD(yli, Aj) ■ 0(log^ n) and 



• Ik- - q'jWi > TEMD{Ai,Aj) w.h.p. 

Thus, the vectors q'^ satisfy the promised properties except they have a high dimension. 

To reduce the dimension of q'^'s, we apply a weak ii dimensionality reduction via 1-stable 
(Cauchy) projections. Namely, we pick a random matrix P of size k = 0(log'^ n) by mh = 0(log s) ■ 



Note that Theorem 3.1 from is stated for EMD, and here we are concerned with TEMD. Nevertheless, 



the whole statement still applies, because the side of the largest grid is bounded by 0(s) 
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]\^0{a) ^ the dimension of ■0i where each entry is distributed according to the Cauchy distribution, 
which has probabihty distribution function f{x) = ^- j^^- Now define Qi = P • q'i ^ Standard 
properties of the dimensionahty reduction guarantee that the vectors qi satisfy the properties 
promised in the lemma statement, after an appropriate rescaling (see Theorem 5 of [|nd06| with 
e = 1/2, 7 = 1/6, and 5 = n'^^^)). 

It remains to show that we can compute the vectors qi in 0{na) time. To this end, observe 
that the resulting embedding P ■ ipi^) is linear, namely P ■ ij^i^) = SaeA ' "^ii^})- Moreover, 
each P ■ ip{{vi}) can be computed in a ■ log'^^^^ n time, because jp{{vi}) has exactly one non-zero 
coordinate, which can be computed in 0(a) time, and then i-*-'0({uj}) is simply the corresponding 
column of P multiplied by the non-empty coordinate of ^lJ{{vi}). To obtain the first vector qi, we 
compute the summation of all corresponding P ■ ip({vi}). To compute the remaining vectors qi 
iteratively, we use the idea of a sliding window over the sequence vi, . . .Vn- Specifically, we have 

qi+i = P ■ ipiA+i) = P ■ i;{Ai U {vi+s} \ {vi}) = qi + P ■ i^i{vi+s}) - P ■ i^{{vi}), 

which implies that gj+i can be computed in a ■ log'^^^^ n time, given the value of qi. Therefore, the 
total time required to compute all g^'s is 0{na • log*^^^^ n). 

Finally, we show how we obtain an efficient embedding of TEMD into min-product of £i's. We 
apply the above procedure / = O(logn) times. Let be the resulting vectors, for i £ [n — s -\- 1] 
and z G [I]. The embedding of a set Ai is the concatenation of the vectors ql^\ namely Qi = 

{ql^\ qf'\ ■ ■ ■ qf^) G ©min^i- "^^^ Chernoff bound implies that w.h.p., for any i, j ^ [n — s + 1], we 
have that 

dmin,l{Q^,Qj) = mm\\q'f^ - q\'^\\ < TEMBs{Ai,Aj) ■ 0(log^n). 

Also, dmm,i(<5j5 Qj) > TEMDs(j4j, Aj) w.h.p. trivially. Thus the vectors Qi are an embedding of 
the TEMD metric on Aj's into ©min^i with distortion O(log^n) w.h.p. □ 



3.1.2 Embedding of min-product of l\ into low-dimensional i\ 

In this section, we show that n points Qi, . . . Qn in the semi-metric space 01^;^ t\ can be embedded 
into l\ of dimension 0(log^ n) with distortion log*^*-^^ n. The embedding works under the assumption 
that the semi-metric on Qi, . . . Qn is a log*^^^^ n approximation of some metric. We start by showing 
that we can embed a min-product of ^I's into a min-product of tree metrics. 

Lemma 3.3. Fix n,M G N such that M = rP^^^ . Consider n vectors vi,...Vn in 0[nin^i, for 
some Z, G N, where each coordinate of each Vi lies in the set {— M, . . . ,M}. We can embed these 

vectors into a min-product of 0{l ■ log^ n) tree metrics, i.e., 0min^°^ TM, incurring distortion 
O(logn) w.h.p. The computation time is 0{n ■ kl). 

Proof. We consider all thresholds 2*, for t G {0, 1, . . . , log M}. For each threshold 2*, and for 
each coordinate of the min-product (i.e., li), we create O(logn) tree metrics. Each tree metric is 
independently created as follows. We again use randomly shifted grids. Specifically, we define a 
hash function /i Z'^ as 



h{xi, ...,Xk) = 





Xl + Ui 




X2 + U2 




Xk + Uk 


( 


2* 




2* 


5 • • • 5 


2* 
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where each ut is chosen at random from [0,2*). We create each tree metric so that the nodes 
corresponding to the points hashed by h to the same value are at distance 2* (this creates a set of 
stars), and each pair of points that are hashed to different values are at distance 2Mk (we connect 
the roots of the stars). 

For two points y G the probabihty that they are separated by the grid in the i-th dimension 
is at most \xi — yj|/2*, which implies by the union bound that 

Vr[h{x) = h{y)] >1-Y: = 1 - 

i 

On the other hand, the probabihty that x and y are not separated by the grid in the i-th. dimension 
is max{l — \xi — yi|/2*, 0} < e"'^'"^*'/^ . Since the grid is shifted independently in each dimension, 

k 

Fv[h{x) = h{y)] < TTe-l=^^-?'^l/2* = g-Eti 1=^-2/^1/2' = ^-\\x-y\U/2\ 

h 

i=l 

By the Chernoff bound, if x, y S l\ are at distance at most 2* for some t, they will be at distance 
at most 2*"'"^ in one of the tree metrics with high probability. On the other hand, let vi and Vj be 
two input vectors at distance greater than 2*. The probability that they are at distance smaller 
than 2*/clogn in any of the O(log^n) tree metrics, is at most n^^"^^ for any c > 0, by the union 
bound. 

Therefore, we multiply the weights of all edges in all trees by O(logri) to achieve a proper 
(non-contracting) embedding. □ 

We now show that we can embed a subset of the min-product of tree metrics into a graph 
metric, assuming the subset is close to a metric. 

Lemma 3.4. Consider a semi-metric Ai = {X,^) of size n in ^[^^jj^TM for some I G N, where 
each tree metric in the product is of size 0{n). Suppose M is a 'j-near metric (i.e., it is embeddable 
into a metric with 7 distortion). Then we can embed M. into a connected weighted graph with 0{nl) 
edges with distortion 7 in 0{nl) time. 

Proof. We consider / separate trees each on 0{n) nodes, corresponding to each of / dimensions of the 
min-product. We identify the nodes of trees that correspond to the same point in the min-product, 
and collapse them into a single node. The graph we obtain has at most 0{nl) edges. Denote 
the shortest-path metric it spans with M' = {y,p), and denote our embedding with cp : X ^ V. 
Clearly, for each pair u,v of points in X, we have p{4>{u) , 4>{v)) < S,{u,v). If the distance between 
two points shrinks after embedding, then there is a sequence of points wq = u, wi, Wk-i, 
Wk = V such that p{4>{u), 4>{v)) = C{wo,wi) + ?(il'i, 1^2) + • • • + C{wk-i,Wk). Because is a 7-near 
metric, there exists a metric ^* : X x X — )• [0, 00), such that £,*{x, y) < ^(x, y) < 7 • ^*(x, y), for all 
x,y G X. Therefore, 

fc-i fc-i 

p{(t){u), (t){v)) = ^ i{wi, Wi+i) > J2 ^*{wi,Wi+i) > ^*iwo, Wk) = ^*iu, v) > C(m, u)/7. 

i=0 i=0 

Hence, it suffices to multiply all edge weights of the graph by 7 to achieve a non-contractive 
embedding. Since there was no expansion before, it is now bounded by 7. □ 
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We now show how to embed the shortest-path metric of a graph into a low dimensional ^i-space 



in time near-linear in the graph size. For this purpose, we implement Bourgain's embedding [ Bou85 | 
in near-linear time. We use the following version of Bourgain's embedding, which follows from the 
analysis in [ |Mat02| ] . 



Lemma 3.5 (Bourgain's embedding [ Mat02f |). Let A4 = {X,p) be a finite metric on n points. 
There is an algorithm that computes an embedding f : X ^ £\ of Ai into £\ for t = 0(log^ n) such 
that, with high probability, for each u,v £ X, we have p{u,v) < \\f{u) — /(f)||i < p{u,v) • O(logn). 

Specifically, for coordinate i € [k] of f, the embedding associates a nonempty set Ai <^ X such 
that f{u)i = p{u,Ai) = minaeA^ p{u,a). Each Ai is samplable in linear time. 

The running time of the algorithm is 0{g{n)-log^ n), where g(n) is the time necessary to compute 
the distance of all points to a given fixed subset of points. 

Lemma 3.6. Consider a connected graph G = {V, E) on n nodes with m edges and a weight 
function w : E ^ [0, oo). There is a randomized algorithm that embeds the shortest path metric of 

G into t-^^ with O(logn) distortion, with high probability, in 0{mlog n) time. 



Proof. Let ^ : y — )• be the embedding given by Lemma 3.5. For any nonempty subset 

A '^V, we can compute p{v, A) for all f G F by Dijkstra's algorithm in 0(m log n) time. The total 
running time is thus 0(m log^ n). □ 

3.1.3 Finalization of the proof of Theorem |3.1| 

We first apply Lemma to embed the sets Ai into 0m[n°^"^ with distortion at most 0(log^ n) 
with high probability, where k = 0(log^ n). We write f j, i G [n — s + 1], to denote the embedding 
of Ai. Note that the TEMD distance between two different Ai's is at least 1/s > 1/n, and so is 
the distance between two different Uj's. We multiply all coordinates of Uj's by 2kn = 0{n) and 
round them to the nearest integer. This way we obtain vectors v'^ with integer coordinates in 
{—2knM — 1, . . . ,2knM + 1}. Consider two vectors Vi and vj. Let D be their distance, and let 
D' be the distance between the corresponding v'- and Vj. We claim that knD < D' < 3knD, and 
it suffices to show this claim for Vi ^ vj, in which case we know that D > 1/n. Each coordinate 
of the min-product is i^, and we know that in each of the coordinates the distance is at least D. 
Consider a given coordinate of the min-product, and let d and d' be the distance before and after 
the scaling and rounding, respectively. On the one hand, 

d' 2knd — k k 

— > > 2kn > 2kn — kn = kn, 

d - d - D - 



and on the other. 



d' 2knd + k , k , , 

— < ; < 2kn H < 2kn + kn = 3kn. 

d d D 



Therefore, in each coordinate, the distance gets scaled by a factor in the range [kn,3kn]. We now 
apply Lemma to f^'s and obtain their embedding into a min-product of tree metrics. Then, we 
divide all distances in the trees by kn, and achieve an embedding of Vj's into a min-product of trees 



with distortion at most 3 times larger than that implied by Lemma 3^, which is 0{logn). 

The resulting min-product of tree metrics need not be a metric, but it is a 7- near metric, where 
7 = 0(log^ n) is the expansion incurred so far. We therefore embed the min-product of tree metrics 
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into the shortest-path metric of a weighted graph by using Lemma 3.4 with expansion at most 7. 
Finally, we embed this metric into a low dimensional ii metric space with distortion 0(log^ n) by 
using Lemma |3.6| . 



4 Applications 

We now present two applications mentioned in the introduction: sublinear-time approximation of 
edit distance, and approximate pattern matching under edit distance. 



4.1 Sublinear-time approximation 

We now present a sublinear-time algorithm for distinguishing pairs of strings with small edit dis- 
tance from pairs with large edit distance. Let x and y be the two strings. The algorithm partitions 
them into blocks Xj and jji of the same length such that x = xi . . .Xf, and y = yi . . .yb- Then it 
selects a few random i, and for each of them, it compares Xi to yi. If it finds an i for which Xi and 
yi are very different, the distance between x and y is likely to be large. Otherwise, if no such i is 
detected, the edit distance between x and y is likely to be small. Our edit distance algorithm is 
used for approximating the distance between specific Xi and yi. 

Theorem 4.1. Let a and /3 be two constants such that < a < f3 < 1. There is an algorithm 
that distinguishes pairs of strings with edit distance 0{n") from those with distance ^}(n^) in time 

^a+2(l-/3)+o(l)_ 



Proof Let /(n) = 2<^(Viogniogiogn) 

a non-decreasing function that bounds the approximation 
factor of the algorithm given by Theorem Let b = f(n)-iogn ' partition the input strings x 
and y into b blocks, denoted Xi and yi for i G [6], of length n/b each. 

If ed(x,y) = 0{n"), then maxj ed(5;j, yj) < ed{x,y) = 0(?i°). On the other hand, if ed{x,y) = 
n{n'^), then maxied{xi,yi) > ed{x,y)/b = f](n° • f{n) ■ logn). Moreover, the number of blocks i 
such that ed{xi, yi) > ed(rE, y)/2b = r2(n" • f{n) ■ logn) is at least 

ed{x,y) - b ■ ed{x,y)/2b _ 

. — \i[n ■ 0). 

n/b 

Therefore, we can tell the two cases apart with constant probability by sampling 0{n^~^) pairs 
of blocks (xi,yi) and checking if any of the pairs is at distance i}{n°' ■ f{n) ■ logn). Since for 
each such pair of strings, we only have to tell edit distance 0(n") from r2(n° • /(n) • logn), we 
can use the algorithm of Theorem |1.1| . We amplify the probability of success of that algorithm 
in the standard way by running it O(logn) times. The total running time of the algorithm is 
0{n^^^) ■ O(logn) • (n/6)i+°(i) = 0(n"+2(i-«+''(i)). □ 



4.2 Pattern matching 

Our algorithm can be used for approximating the edit distance between a pattern P of length n 
and all length-n substrings of a text T. Let A'^ = |r|. For every s € [iV — 2n -|- 1] of the form 
in + 1, we concatenate T's length-2n substring that starts at index s with P, and compute an 
embedding of edit distance between all length-n substrings of the newly created string into for 
a = 2*-^(^'°s"'°s We routinely amplify the probability of success of each execution of the 
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algorithm by running it O(logA^) times and selecting the median of the returned values. The 
running time of the algorithm is 0{N log N) ■ 20(Viogniogiogn)_ 

The distance between each of the substrings and the pattern is approximate up to a factor 
of 2'^(^^°§"''°s and can be used both for finding approximate occurrences of P in T, and for 
finding a substring of T that is approximately closest to P. 
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